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yielded significantly less total test information than did %he, 
Bayesian tailored testj.ng procedure. The major difference between the 
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Within the last decade, tailored testing has become one of ti^.e motiv^ating 
forces behind the appi icatioR-of latent trait tteory to "achievt^ment and .ability 
measurement; This growing attractiveness of tailored testing is the result of 
the pcoblems-'inherer^t in^con.yentional paper-pSncil testing procedures and the 
recent avai lability of adequate computer technology. In the conventional test- 
ing situation items of inappropriate difficulty are administered to %ome of the 
examinees.* For example, ^e/aminees of low^ebility often receive ^tems that are 
too difficult for theip^-and subsequently they may become frustrated. Obversely , 
examinees of high ability levels;* may receive items that are not challenging and>^^ 
as a. consequencfe they may become bored by the testing procedure. Ideally every- ♦ 
one should receive items appropriate to his or her level of ability. Conventional 
tests are most appropriate and .most accurate for examinees of average ability. 
Therefore the standard, error of measurement is .ordinarily higher^at the extremes 
of the ability range than it is at the middle of the abilfty range (Koch 3ncl 
Reckase., 1379). 

Tailored tejsting is designed to circumvent these problems by attempting to. 
administer to each examinee only items of appropriate difficulty. Matching nem. 
difficulty to ability level should reduce- the errdrs of measurement at the ex- 
tremes of the ability rangg^ thus reducing one of the problems of conventional 
paper-pencil testing. In order for tailored tests to select Items of more ap- 
propriate .difficul ty, the s^election of an item is based upon the ability esti- 
mate obtained from the previously administered items. Because! of the advantages 
accrued by this procedure .and vthe growling availability of computer technology, 
tailored testing systems will 'most certainly proliferate in the future. , 

' V ' „ ,^ / * 

In tailored testing there are tw?> commonly usedv methods of operation.- These 
two methods are based on a maximum 1ikel i hoodcabll ity estimation procedure and 
a Bayesian ability estimation procedure (Owenf, 1975). The first procedure esti- 
mates a subject's ability after each item using an empirical maximum likelihood 
technique. Th? ability estimate ;is -then used to select the next items.in such a 
way that the item information is maximized' at that ability level (Birnbaum, 1368). . 
With t^r second procedure, ability is estimated as, the mean of the posteri(yrv 
abil ity distribution^ and items are .selected to minimize the posterior variance*, 
of the ability estimate distribution, ^hi le assumina a normal prior dis'tribution 
of ability. As the$e two methods of; operation are Siibst^ntjaT ly different, it 
is important to examine. the quality of results 'from, these two feailpi^ed testing 
procedures inorder to make ari educated, decisiort in chdos1ryr;^hich procediire to- 
implelnent. - ^ " - , ^ . 

This research will therefore 'cQmpar a the two procedMresj on the basU of ob- 
tained ability estimates, obtained total/tis^t informationi and reliability. 
However, since tailored tests'^nee'd not be fixed in lengtfi", tffe fir^t step in 
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this research is to detfermine the optimal test lengtli for each procedure. This 
was done since Reckase (1974) has found that continued testing beyond the point 
at wh^ich the ability estimate stabilizes may introduce bias in the ability esti- 
<mte. This is consequence of the fact that most of the appropriate items from 
the item pool for that ability leveT have been used and only irappropriate items 
are available. The determination of tKe appropriate test U-^ngth will be accomp- 
lished by using information and posterior variance. Once the test lengths for 
the two procedures have been obtained the ability estimates yielded by th«^ two 
procedures at those lengths will be compared. Also the total test information 
yiefded and the reliability ;:oef f icients yielded by the two proq,edures will be 
CG^mpared. It is hoped.that by making these comparisons a clearTy 'preferred pro- 
cedure will emerge thus making the selection of a tailored testing system that 
much easier. , >< ' : « 

Instruments 



The major instruments used in thi$^ research were the tailored test based 
on the maximum likelihood ability estimation procedure; and the tailored test 
based on the Bayesian ability estimation procedure. The items used in the study 
were 137 ite.ms from the School and College Ability Test (SCAT), Forms 2A and 
3A (Educational Testing Service, 1975). These itetns nrveasured vocabulary know- 
ledge using two different item formats, but -all items were of the five choice 
multiple-choice form. The tailored tests were administereu on an Amdahl 470/V7 
computer via the IBM Time Sharing Option. The subjects received theM'tems on 
an ADDS Consul 980 terminal. 



Method 



The experiment extended over the winter semester and summeA session, of 
1980, with a different groi'p of subjects each tjme period. (The summer , session 
will hereafter be referred^ to as the summer semester in order tji avoid- confusion) 
The subjects who participated in the study were graduate and undergraduate stu- 
dents enrol led in meastirement courses at the Univer^sity of Missouri-Columbia. 
During the winteV semester the subjects were enrolled in a graduate/undergraduate 
course entitled, "Group Intelligence Testing"*, and an undergraduate course en- 
titled, "Introduction to Educational Measurement and Evaluation". Tc recruit 
voTunteers for ^the .experiment , students 1n the classes were adyised at the be- 
g1nnlT)g of each semester that those who. volunteered to participate ih the study 
would' receive extra credit towards their course grade. Each subject was re^^uired 
to participate in two sessions which were one week apart. ^ 

During both the winter and summer semester, the students who volunteered to 
participate in the experiment were randomly- assigned to either the maxijiium like- 
lihood ^tai lored test procedure or. the ^^ayesian tailored test procedure. During 
the Winter ^semester there were 19 subjects assigned to the Bayesian tailored 
testing procedure and 18 subjects were assigned to the maximum likelihood pro- 
cedure. 3ecause^thre$ subjecj:s failed to complete the second session only 16 
subjects were incVjded in thei Bayesian -abi 1 ity estimation procedure rjnd 18 in 
the^maximum likelihood .ability estimation procedure,. During the summer semester 
there were subjects Included in the Bayesian procedure and 23 in the maximum 
likelihood procedure. In total there were 70 subjects who completed the ex- 



perinient, v , 

Each Subject who participated in the experiment received either a test ad-, 
mi ni'stered under the maximum Itxelihood condition-fdr both testing sessions or^ 
a test administered under the Bayesian condition for both sessions. Jhe^tests 
were administered one week. apart for each subject". ^ The two different t^s-ting 
sessions were started using two different ability estimates, either - .100 or ^ 
,150, 50 that the two different testing sessions would not be identical. 

Analyses - . ' 

The first analysis performed was a comparison of ability estimates'. from the 
two seoiesters using analysis of variance techniques^, * Next, a. determination of ^ 
the optimal test lengths was madg by subjectively evaluating plots of the con- 
vergence of the ability eTtimates, The reliabil ities were; then co'mputedMcroSs ; 
sessions and were compared using chi^square analyses. The total test ' ^ ' 
information yielded by th^ two different pt^ocedures was compared using analyses . 
of variance, andnheh the ability estimates yielded by the t\b procedures i^ex:e ' 
compared using a 4*way analysis of variance. Test, session, semester and length 
were the independent variables for the analysis.. 



Results 



Before , the data could be analyzed* a determination had to be;j made ^whether ^ 
the ability estimates ♦ from both the winter and ^summer seines ters';'could be pooled," 
Because there' were "graduate students included fn.the suWer semester group of 
subjects there^ was some reason to ;iuspect that the mean ability estimated; ob- 
tained froiti two semesters were different, and possibly the two groups data- 
should not be combined. To discern ff there were differences in 'the ability/ ' 
estimates from the two semesters a three-way analysis of variance CANOVA) was 
perforiney on thte ability estimates' obtained at the 20 item level . Since there 
wa4>ya, potent ial difference between the ability estimate scales yielded by the 
two procedures, the ability estimates obtained within each tailored tes1:ing pro\ 
sedure were converted to T-scores to eliminate the test effect. In, the ANOVA 
the independent variables""were test _ (maximum' likelihood vs, Bayes i an ), semester 
(winter vs. summer)., and session, wtfth session bein^i a repeat'^tf measure; This 
resultS'.of this ANOVA are summarized' in Table 1. As seen in Table 1 the semester 
main effect was significant (£ = 7.}9, £< ^05). Subsequently the ^decision was 
made to analyze the data using semester as an independent variable. It can be 
S'een In Table 2 that the ability estimates from thtr summer semester were greater 
^hai^ the ability estimates from the winter semester for both the maximum like- J 
-^li hood tailored testing procedure and for the Bayesian tailored testi/ig procedure 



Table 1 

.. Results of Three-Way ANOVA on the Ability Estimates . 

. ^ Yielded at the' 20 Item Level With. Test, 
■ ' Semester, and Session as Independent' Variabl ss 



Soupce- V, 


SS; 




, df ' 




F 


£ 


Te^st 


35. 


95 




35.96 


0.20 


.655 


Semester ^\ .< . ' 


■ 1406. 


51 




1406.51 


7.89 


.006 


test X Sernester 


35. 


95 




35.96 


0.20 


v655 


Error . ^'^ 


11942. 


65 


67 


178.25 , 






Session 


50. 


'87 




'50.87 


• 6.63 


.012 


Yest X Session 


9. 


S7 




, 9.67 


1.26 


.265 


Semester x Session 


- 2. 


03 




•2.0 - 


-0.26 


.603 


Test x' Semester x Session 


o: 


10 




0.10 


0.01 


.910 


Error ^. ^ 


513. 


97 


■ 67 


• 7.67 
— -i _ 







It can also be st^en in table 1 that the sessions main effect was, significant 
(£ « 6.63, £< .05). From Table 2 it can be seen that the mean "abi lity estimates 
were greater in the sjecond session than 4n the firs^^ session for both the mv^xi-- 
mum likelihood procedure and the Bayes^ari'^procedure . 

%' ' Table 2 ' . ' ^ 

Mean Ability Estimates in T-Score Form a^t the 20 Item Level 
for the Winter and Summer Semester and for the 
Maximum Likelihood and Bayesian Tailored, Testing 

Procedure 





Winter 






Sunrner 


jTest 


Session 1 


Session 2 


Session 


r. Session 2 


Maximum . 
Likelihood 


46.35 


47.24 


51.96 


' 52.46 


Bayesian 


45,76 

r 


47.82 


53.53 


55.00 



yj 



-5- 



. After the decision was made not to. combine the data from the two semesters 
a determination of the optimal test length for the two tailored testing proced- 
ures had to be made. For the maximum likeHliood procB<^ure tha^values afthe 
ability estimates o^btained after each item and the item information -estimates 
at the ability estimates were plotted. For the Bayesian procedure the values of 
th"'e abi-^ity estimates after each item 'and the new standard errdr of estimate were, 
plotted. A visuaL_evA-luation of the plots from both semesters suggested tliat'' 
the ^oint for-which the curves flattened was at the IZ item level for ttte maxi- 
mum likelihood procedure and at the 14 item level , for ttie Bayesian procedure 
i(See Figure T and Figure' 2 for examples of these plots). The' flattening-of the 
curves "indicated convergence to an ability estimate. Thus the decision was made , 
to analyse the data from both semesters at these levels as wejl as the 20 item' 
. level, . • s 

\ - ■ ^ 

After making, this decision the nex^t analysis to be performed was the, com- . 
parisons bf the reliabilities. The reTiabi 1 iti.es for each test were computed 
across sessions at ^the 12, 14, and 20 item levels within e%jch semester.' The 
rel iab^ilitie^-^^were computed for both ability estimates- and estimated true scores 
{Lord» l?68)'ahd ar6 shbwn in Table 3. ^ The first comparison' was a chi-square on 
the estinjated true^ score rel iabi 1 ities 'in order to determ^ine if the reliabilities 
were estimates of the sam^ correl ation ^ (Snedecor and Coci^n-j^l^SO} / It was ' 
not j^fgntMicant. ^ The second comparison was on the reliabiliti9s -for the nb'ility 
"estimates. It also was not significeint. 'Although it ap^pears as if^therfe i:: not 
'a significant difference between the reliabilities of the two'dif ferent testir.v. - 
procedures", nor between the various test lengths, it must be remembered, that 
these rel iabi 1 l^ties were'bbt^lned using relatively small s^amples and thus it 
would take a large difference to be significant. . ' 

' ^ Table 3 

. Bayesian vs. Maximuni Li kel i hood Tailored Test Reltabilities 
for Winter and Summer Using Abilities and 
' Estimated True Scores ' 



Test Estimate- 



Winter (V • Summer 



20 Item 14 Item 12 Item 20 Item 14 Item 12 Item 



Bayesian 


Ability 


.914 


.919 V 


. 866 


.963 


.929 


,905 




Bayesian 


True Score 


:885 


:900 


.830 


-.946 


.881- 


.855 


Max. Like. 


Ability 


.925 


.865 


.943 


-.908 


.748 


.777 




Max. Like. 


True Score 


^ .899 , 


.820 


.936 


.921 


.875 


.8^9 ' 





The next analy.sii\to be performed was the comparison of the test information 
yielded by the two' ;>rocBdurea at the 20 item level. Using .the. 20 ttejii level was 
deemed appropriate'since the reliabilities -for ^the different test lengths were 
not significantly different, and as indicated Before both tests appea'red to be ' 
yielding consistentabi 1 it^ estimates by the> 14 itefti level-. . A thr^e-way ANOVA was, 
performed' ov,er the data, using as independent variabl eS ^est (maximum likelihood 
vs'. Bayesian), semester (winter vs. pummer) and session ; with session being a . * 
repeated measure. The dependent variabl e was, tota.l test information at the final- 
ability estimate for 20-item level . 'The remits of the ANOVA on .ttie total test " 
information at the 20\i>tem level are shown in Table 4.. , 



Table 4 • 

'Results of the Thrse-Way ANOVA on the Total Test 
Information Yielded thS-120 Item Level' Using 
Semester Test and Sesi?if)r Independent -Variables 



Source 


., .SS ■. 


df. 


MS 


F ' 


■ s.. 














Test 


494.71 


1 


494.71 


' 6!63 . 


0.012 


Semester 


455.64 


1 


455.64 


6.11 


0:016. 


Test X Semester > 


3,14 


1 


3.14 


0.04 . 


0.838 


Error 


4997.47 


57 


74.59 






Session 


8.04 


1 


8.04 


2.22* . 


o:i4i . 


Session x Test 


6.90 


• t . 


6.90 ■ 


1 .90 


0.172 


Session x Semester^' 


0.33 


1 • 


0.33 


-C.09 


0.764 


Session x Test x' Semester 


4.77 




4.77 


1.32 


0.255 • 


Error r- . 


243.05 


67 


3.53 







-As seen jn the table the test main effect was significant (F = 6.63, p< .05) 
indicating that the tw© procedures- were signif^icantly different Toi^ the average 
to;tal test information. The n^ean information values prjasSnted in Table 5 show ' 
th^t the Bayesian procedure yielded more total test information th^an did the max- 
imum likelihood procedure. The only othSr significant effect was th.e semester. 
ma:in effect ( F = 6.11, £< ,05). This was not surprising, as earlie/f? resuUs 
showed that the ability estimates from the two procedures were differerlt foV the^ 
two semesters. Since the -summer semester yielded higher ability estijna'tes , this 
would have resulted in iterrts with greater b^-values being selected, for. the s-ubjects 
during, the summer session i Because there are f^wer optimal ^tv^ms av,ai Table at 
the extremes of the it«m ppol this resulted in items being seiocted that yielded . 
less than optimal item information. Since the. total test information is cbntin- 
gent upon item information, this would r isult in lower test information duV-ing 
the summer. This can be seen in Table 5. 



Table 5 

Mean Total Test InforlDfiation for the Bayesian and 
Maximum Likelihood Tailored Test Procedures .for the 
Winter 'and Summer Semester 









Bayesian 




Maximum Likelihood 


Semester 


Session- 
















20 Item 


14- Item 


12 Item 


20 Item 


14 Item 


12 Item 


Winter 


1 ' 


40.89 


30.83 


26.62 ^• 


38.20 


■ 27.89 


24.64 


Winter 


2 " 


41.33 


31.61 


27.61 ' 


a6.98 


27.60 


23.98 


Summer . 


1 


38.00 


29.35 


26.13 


33.95' 


25.84 


22.56 > 


Summer 


2 


37.49 


29.09 


25:67 


33.29 


24.79 


21.62 , 



After comparing the total test information yielded by the ma)^,iniuni likelihood, 
tailored testing procedure and the Bayesian tailored testing procedure, the ijext 
analysis was the comparistfn of the ability estimates yielded by the two procedures. 
To make this comparison a four-way^AHOVA was used in or<ler to exam'' ne the effect 
of the test length on the ability estimate as well^ as the effect of^ the two dif- i 
ferent tailored .testing procedures. The independent variables wdre test (maximum 
likelihood vs. Bayesian) » length (20 items, 14 items, and 12 items) ^.semester 
(winter vs. summer), and session, with session' and length being repeated meas- 
ures.' The dependent variable was the ability estimates from -the 12 i,tem, 14 item, 
and 2-0 item levels. It was expected from prior results that there would be a 
semester main effect as well as ^ session main effect. The results shown in 
Table 5 indicate that both the semester main effect (£ = 8.33,- .05) and the 
session main effect (£ = 7.50, £< .05) were significant. The session main effect 
ftas probably due to practice. 

It is albo ^leai^-from Table 6 that the test main effect was significant (£ = 
15.43, ^< ,.05), Indicating a significant difference between the ability esti- 
mates , yielded by the two different taUored testing procedures. ,An examination 
of Table 7 indicates that the maximum likelihood procedure yielded greater ability 
estimates for both semesters, across sessions, and for all three different test 
lengths. Also important to be noted is the lack of a main- effect for test 16iigth 
(P= O'.ei, £> V05). This indicates that fhe mean ^ ability estimates at the dif- 
ferent lengths were not significantly different from one^ another. There was an 
interaction of test leagth and test (F = 6.39, g^< .05). Thts'^Hmteraction is 
exp^lainable by the ability estimates 7rom the maximum likelihoodHailored testing 
procedure staying relatively stable while the ability estimates from the Bayesian 
tailored t^esting procedure changing with test length. (See Table 8). 



Table 6 



Results" of the Tour-Way ANOVA on the 
Ability Estimates From the Maximum 
Likelihood and Bayesian Tailored Test 
Procedures at the Three Different Test Lengths 



Source 



SS 



df 



MS 



Semester 
Test 

Semesten x Test 

Error 

Session 

Session x Semester, 

Session x Test 

Session x Semester x Test 

Error 

Length 

Length x Semester 
Length x Test 
Length x Semester 
Error* 

Session- X Length 
Session x Length x 
Session x Cength x 
^'Session x Length x 
Error 



X Test 

\ 

Semester 
Test ^ 

Semester x Test 




8.33 
15.43 
0.18 

'7.50 
0.00 
0.15 
0.97. 

r 

0.61 
0.11 
6.39 
0.72 

1.63 
0.32 
0.46 
1.43 



0.005 
0.000 
0.675 




0.199 
0.728 
0.630 
0.242 



2v 



Table 7 



Mean Ability Estimates for the Maximum Likelihood 
and Bayesian Ability Estimation Procedures at the 
12, 14, and 20 Item levels 









Mean Ability Estimates. 




Semester 


Item 


Maximum 


Likelihood 


Bayesian 














Session 1 


Session 2 


Session 1 


Session 2 


Winter 


12 
14 
20 


1.32 - 
■ 1.2S 
1.25 


1.36 
1.33 
1.30 


0.67 

. 0\69 
0 ."78 


0.85 
* 0.98 
0.89 


Summer 


12 
14 


1;53 
- 1.50 


1.68 g' 

1.68 ^ 


1.07 
1.12 


1.15 
1.78 




- 20 


1.53 


;.55 


1.18 


1.26 



\ 



Since the ability estimates yielded by the two different procedures were 
significantly di fferent, it seems likely that the items selected by the proced- 
ures would.be different. As inspection of a frequency count of item usage indicated 
that the maximumnikelihood procedure was utilizing Items with higher b^-values 
than was the Bayesian procedures. " • ^ » 



Table 8 

Mean Ability Estimates for the Maximum Likelihood 
and Bayesian Ability Estimation Procedures 
Combined over Semesters and Sessions 



Item 


.1 

Maximum Likelihood 


Ability Estimates 


Bayesian 


12 


1.48- 


. , / 


.906 


■ 14 


1.47 . ' .. / 




.914 ' 


20 




V 1 


• : -.992 



■ ' ^ Summary and. Conclusions 

Thee overall purpose of this research was to compare a, maxi*mum likelihood 
based tailored testing procedur.e to a Bayesian tailored testing procedure. The 
resul ts -indicated that both taiijored testing procedures produced equal Vy reliable 
ability estimates. Alsd an analysis of test length liidicated that reasonable 
ability estimates could b6. obtained using 12;^ ta 14 items'. 

It was also seen in the results that the maximum likelihood tailored test- 
ing proc^ure yielded significantly less tbtal test information than did the 
Bayesian tailored, testing procedure. This seemed to*be-a result of the fact that 
the maximum likelihood procedure yielded significantly higher abil ity estimates , 
thus utilizing from the~i tem ^pool items With greater b-values. At the extreifies 
of the. item pool 1;here were fewer optimal items, from which- to choose. This, pro- 
blem may have been al leviated had the item\ pool * had. more items with greater b^- 
values . ' / ' 1 « ^ 

The mdjo"^ (difference between the two procedures seems to be in the signifi- . 
cantly differf - ability estimates that they 'yielded^ An ^ex^ajrii nation of the 
abil ity estimation procedure used by the two procedures explains why this^dis- 
crep}pj|cy exists. The Bayesian tailored testing procedurje performs its ability 
estimation on the basis of the prior abil ity distribution . This results in a 

regression towards the prior mean for this p^^ocedure's abil ity estimates. Since 

\ ■ ■ ' " ' 

\ ■ ' . 



•in this study the initial ability distribution had a mean well below the popu- 
lation ability level, the result was'- an inhibiting e'ffect on .the final, abil ity 
estimates, tt was predicted that had the prior ability distribution been greater 
.than the population value the result. would have been thatthfe final ability esti- 
mates would havQ been greater. Thf's result S/as borne out when the ability esti- 
mates for. the Bayesian group were recalculated using a prior mean ability esti- 
mate of 2.00. The,resu_lts were that the recalcuUted abil ity- estimates were ' 
significantly higher (x^ = 1.06, t^ = 4.34, £ < .05^. This resuT* points out the 
importance of the prior to the Bayesian procedure. An inaccurate prior can af- 
fect "the abil ity estimates. .Since knowledge of the prior is bfter^ not available 
this -procedure could result in biased estimates of ability. It thus seems that 
the maximum, likelihood procedure. is the procedure of choice if an adequate prior 
distribution is not available. ' 
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